Metric Indexes for Approximate String Matching in a Dictionary
نویسنده
چکیده
We consider the problem of finding all approximate occurrences of a given string q, with at most k differences, in a finite database or dictionary of strings. The strings can be e.g. natural language words, such as the vocabulary of some document or set of documents. This has many important application in both offline (indexed) and on-line string matching. More precisely, we have a universe U of strings, and a non-negative distance function d : U× U→ N. The distance function is metric, if it satisfies (i) d(x, y) = 0 ⇔ x = y; (ii) d(x, y) = d(y, x); (iii) d(x, y) ≤ d(x, z)+d(z, y). The last item is called the “triangular inequality”, and is the most important property in our case. Many useful distance functions are known to be metric, in particular edit (Levenshtein) distance is metric, which we will use for d. Our dictionary S is a finite subset of that universe, i.e. S ⊆ U. S is preprocessed in order to efficiently answer range queries. Given a query string q, we retrieve all strings in S that are close enough to q, i.e. we retrieve the set {u ∈ S | d(q, u) ≤ k} for some k. To solve the problem, we build a metric index over the dictionary, and use the triangular inequality to efficiently prune the search. This is not a new idea, huge number of different indexes have been proposed over the years, see [2] for a recent survey. An example of such an index is the Burkhard-Keller tree [1]. They build a hierarchy as follows. Some arbitrary string (called pivot) p ∈ S is chosen for the root of the tree. The child number e is recursively built using the set Se = {u ∈ S \ {p} | d(p, u) = e}. This is repeated until there are only one, or in general b (for a bucket), strings left, which are stored into the leaves of the tree. The tree has O(n) nodes, where n = |S|, and the construction requires O(n log n) distance computations on average. The search with the query string q and range k first evaluates the distance d(q, p), where p is the string in the root of the tree. If d(q, p) ≤ k, then p is put into the output list. The search then recursively enters into each child e such that d(q, p) − k ≤ e ≤ d(q, p) + k. Whenever the search reaches a leaf, the stored bucket of strings are directly compared against q. The search requires O(n) distance computations on average, where 0 < α < 1. Another example is Approximating Eliminating Search Algorithm (AESA) [4], which is an extreme case of pivot based algorithms. This time there is not any hierarchy, but the data structure is simply a precomputed matrix of all the n(n−1)/2 distances between the n strings in S. The space complexity is therefore O(n) and the matrix is computed with O(n) edit distance computations. This makes the structure highly impractical for large n. The benefit comes from search
منابع مشابه
Finding Approximate Matches in Large Lexicons
Approximate string matching is used for spelling correction and personal name matching. In this paper we show how to use string matching techniques in conjunction with lexicon indexes to find approximate matches in a large lexicon. We test several lexicon indexing techniques, including n-grams and permuted lexicons, and several string matching techniques, including string similarity measures an...
متن کاملApproximate String Matching: Theory and Applications (La Recherche Approchée de Motifs : Théorie et Applications)
The approximate string matching is a fundamental and recurrent problem that arises in most computer science fields. This problem can be defined as follows : Let D = {x1, x2, . . . xd} be a set of d words defined on an alphabet Σ, let q be a query defined also on Σ, and let k be a positive integer. We want to build a data structure on D capable of answering the following query : find all words i...
متن کاملFast Approximate String Matching in a Dictionary
A successful technique to search large textual databases allowing errors relies on an online search in the vocabulary of the text. To reduce the time of that on-line search, we index the vocabulary as a metric space. We show that with reasonable space overhead we can improve by a factor of two over the fastest online algorithms , when the tolerated error level is low (which is reasonable in tex...
متن کاملApproximate String Matching ? Edgar
We present a radically new indexing approach for approximate string matching. The scheme uses the metric properties of the edit distance and can be applied to any other metric between strings. We build a metric space where the sites are the nodes of the suux tree of the text, and the approximate query is seen as a proximity query on that metric space. This permits us nding the R occurrences of ...
متن کاملApproximate string matching algorithms for limited-vocabulary OCR output correction
Five methods for matching words mistranslated by optical character recognition to their most likely match in a reference dictionary were tested on data from the archives of the National Library of Medicine. The methods, including an adaptation of the cross correlation algorithm, the generic edit distance algorithm, the edit distance algorithm with a probabilistic substitution matrix, Bayesian a...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2004